DOCOHENT RESOME 



ED 082 469 



EM Oil 428 



AUTHOR 
TITLE 

PUB DATE 
NOTE 



EDRS PRICE 
DESCRIPTORS 

IDENTIFIERS 



Munick, Herman; Allison, John 

On Uses and Misuses of Computer Programs in 

S tatistics, 

Jun 73 

9p. ; Paper presented at the Conference on Computers 
in the Undergraduate Curricula (4th, Claremont, 
California, June 18-20, 1973) 

MF-$0,65 HC-$3.29 

^Computer Assisted Instruction; *Compu ^.er Programs; 
Correlation ; *Statistical Analysis; ^Statistics 
Autocorrelation ; Distributions ; Linear Regressions; 
Normal Curve; Normal Distribution; T Values 



ABSTRACT 

Distributions and linear regressions are discussed. 
The section dealing with the former topic emphasizes the usefulness 
of computer programs in statistics, demonstrating their ability to 
handle tedious and time--consuming tasks* The normal curve is stressed 
since the assumption of a normal distribution is common. An example 
of 200 data points is presented which illustrates a computer 
program's ability to do descriptive statistics for data not grouped, 
data which is distributed into any number of classes of equal width 
and testing if the 200 points are normally distributed. The section 
of linear regression emphasizes how computer programs can lead to 
erroneous results. In the area linear regression there are in current 
use "canned programs" which display the t-values corresponding to the 
coefficients of the least square estimate, but do not take 
autocorrelation into account. In the presence of autocorrelation it 
is incorrect to use these t-values. An example is presented in which 
autocorrelation is present but can be removed by a suitable 
transformation of variables, (Author) 



^RIC 



ON OSES AND MISUSES OF COMPUTER PPOGRAMS IN STATISTICS 









Dr. Herman Munick* 


<\} 


College of Business 


St. John's University 


Jamaica, New York 11t*32 


CO 


(212) 969-8000 


CD 




f Q 


John Allison 


9^*0 East 27th Street 


UJ 


Brooklyn, New York 11210 


(212) 252-70S6 



This paper is divided into two sections, DISTRIBUTIONS, and LINEAR REGRESSION. The 
first section DISTRIBUTIONS emphasizes the usefulness of computer programs in statistics. 
Some very tedious and time-consuoiiiig tasks are made considerably easier and clarified by use 
of these computer programs. In particular, the normal curve is emphasized since in many 
areas of statistics the assumption "a population is normally distributed" appears time and 
time again. An example of 200 data points is presented illustrating the computer program's 
ability to do descriptive statistics for data not grouped, data which is grouped into any 
number of classes of equal width, and testing if the 200 points are normally distributed. A 
graph of the data also appears in the output. 

The second section LINEAR REGRESSION emphasizes tow computer prograuis can lead to 
erroneous results. In the area of linear regression there are in current use ''canned 
programs" which display the t-values corresponding to the coefficients of the least square 
estimate, but do not take autocorrelation into account. In the presence of autocorrelation 
it is incorrect to use these t-values. An example of this is presented where autocorrelation 
is present but can be removed by a suitable transformation of variables. 



Distributions 

The following are the salient features of a program written to obtain the following: 

1. Descriptive statistics for data which is not-grouped. 

2. Indicators that a set of points are normally distributed (necessary but not 
sufficient conditions) . 

3. Grouping original data into any number of classes of equal width, with descriptive 
statistics. 

A graph of the grouped data (frequency plotted against mid-point of CLASS) . 
5. CHI-SQUARE test for goodness of fit of normal curve. 



03 
OS 



U S DEPARTMENT OF HEALTH. 
EDUCATION A WELFARE 
NATIONAL INSTITUTE OF 
EDUCATION 
THIS DOCUMENT HAS BEEN REPRO 
DUCED EXACTLY AS RECEIVED FROM 
THE PERSON OR ORG AN IZ AT ION OR I GIN 
ATING IT POINTS OF VIEW OR OPINIONS 
STATED DO NOT NECESSARILY REPRE 
SENT OFF ICIAL NATIONAL INST.TUTE OP 
EDUC ATION POSITION OR POLICY ^ 

FILMED FROM BEST AVAILABLE COPY 



o 



ERICJ 



Exafflgle 



The following is a display of an input and output cf the program for data points as 
indicated : 



0 Q 

y ^ uu 


UA lA 


n 

M • 


/ , M . 


G Z , 4 . 


48,4. 


45,4.2,4.14,4.1,4.08,4,4.39 




o Q n 1 
V y u 1 


DATA 


fi 

*4 . 


/I /I il 


m a 

• J J , 4 


.29,4 


.28,4. 13,4. 33,4.37,4.43,4.59,4. 


1 




n fl ^ 
VPi i a 


;i 

4 • 


MO /I 
-4 -7 f 4 


• J 4 , 4 


.33,4 


.24,4.2,4. 17, 4. 28, 3. 7t>, 4. 16, 4. 5 




Q Q n 1 


n a T a 

V R L i\ 


^ f 




, 4 • ^ ^ 


,4. 53, 4. 55, 4. 44, 4. 02, 4. 2, 4. 13,4.16 




0 Q n u 


sJ i\ L n 


n 


^ , 4 . 


1 J , 4 • 


1 > a 


19,4. 18,3.98,3.96,4.17, 4,2, 3.99 




^ J V J 


U A 1 A 




1 A a 

1 D / 4 


• 44 , 4 


11 il 
• -3 J , *+ 


.59, 4. 54, 4. 15,4.21,4.42,4.33,4. 


34 


Q Q n A 


LI A 1 A 


n 

4 • 


1 J , 4 


'^^ u 


9 9 U 

. ^ ^ , 4 


.33,4. 12, 4. 46, 4. 27, 4. 21, 4. 2, 4. 05 


99 07 


n ft T h 




^ O f 4 


IS U 
• 1 J , 4 


.08,4 


.12,4. 1,4, 15, 4. C7, 4. 25, i^, 3, 4. 29 




9 908 


u ti. L n. 






.37,4 


.2,4. 


31,4.09,4.23,4.2,4. 19,4,4. 39 




Q Q OQ 


n ft Tft 




J O J 4 


. 4B , 4 


.42,4 


.38,4. 11,4. 13, 3- 68, 4. 23 , 4. 23 




^ > 1 u 


n ft T a 


a 

4 , 


17 i\ 

1 / f *♦ 


9S u 


.18,4 


.19,3.99,4.07,3.76,4.12,4,4.09 




9911 


DATA 


3. 


34,4 


.13,3 


-37,4 


.21,4. 19, 4. 28, 4. 02, 4. 02, 4. 24, 4. 


07 


9912 


DATA 


4. 


35,4 


.27,4 


.05,4 


.14,4. 16,4. 13,4.19,4.05,3.91,3. 


89 


9913 


DATA 




4-03 


,4.07 


,3.98 


,4.2,4, 1,4. 13, 4.07, 4.28,4. 05 




99 14 


DATA 


4. 


45,4 


.24,4 


.08,4 


.09,3.95,3.95,3.87,4. 13,4.09,3. 


97, 3. 95 


9915 


DATA 


3. 


93,4 


.25,4 


• 23,3 


.99, 3. 95, 4,3. 95, 4. 08, 4. 13,4.13 




9916 


DATA 


4. 


36,4 


.35,4 


.06,4 


.2,4. 17,3. 88,3. 99, 3. 98, 4.1, 3,88 




99 17 


DATA 


4. 


02, 4 


-18,4 


.35, 4 


. 08, 4. 06,3. 92, 3. 9, 4. 16, 4. 14,4.06 


9918 


DATA 


4, 


25, 4 


. 41 ,4 


,07,4 


.08,4>04,4.1,4.21,3.87,3.9,3.93 




9919 


DATA 


4. 


15,4 


. 08,4 


.29,4 


.0 7, 4. 01, 4. 25, 4. 03, 4. 27, 4. 07, 4. 


17 


9998 


DATA 


-1 












9999 


END 















CHOOSE WHAT INFORMATICN YOU NEED 

A) NOT-GKOUPED EAXA 

B) GROUPED DATA 

C) BOTH 

D) NONE OF THE ABOVE 

ANSWER A,B,C, OR D 
?C 

HOW MANY CLASSES THERE? 
?9 

WHAT IS THE MINIMUM NUMBER OF THE FIRST CLASS? 
?3.67 

WHAT IS THE MAXIHUM NUMBER OF THE LAST CLASS? 
?4-75 



STATISTICS FOR •NOT-GROUPED' EATA 



THERE WERE 200 OBSERVATIONS MADE 

THE AVERAGE IS 4.1714 

THE VARIANCE IS 3.09021E-02 

THE STANDARD DEVIATION IS .17579 

THE MAXIMUM VALUE IS 4.7 

THE MINIMUM VALUE IS 3.68 

THE RANGE OF THE OBSERVATIONS IS 1.02 

THE MEASURE OF SYMMETRY IS .260124 

THE MEASURE OF PEAKEDKESS IS 3.10359 



THE FIRST DEVIATION BAND BETWEEN 3-99561 AND 4.34719 
CONTAINS 68% OF THE OBSERVATIONS 

THE SECOND DEVIATION BAND BETWEEN 3.81962 AND 4.52298 
CONTAINS 95% OF THE OBSERVATIONS 

THE THIRD DEVIATION BAND BETWEEN 3*64403 AND 4.69877 
CONTAINS 99. 5« OF THE OBSERVATIONS 



ERIC 



STATISTICS FOR 'GEJUPED* DATA 



CLASS BOUNDARIES 

FROM TO MIDPOINT FREQUENCY 

3.67 3.79 3.73 3 

3.79 3.91 3.85 9 

3.91 4.03 3.97 28 

U. 03 a. 15 4.09 54 

4. 15 4. 27 4.21 51 

4.27 4.39 4.33 31 

4.39 4.51 4.45 17 

4.51 4.63 4.57 6 

4.63 4.75 4.69 1 



0 50 100 

f % I 

+ ++++++++++++'V+ + + + + + + + + + + + + + + +++*f + + + + ++ ++ +++ + + + ++ +++ + + + + + +++ + + + + + 



3.73 + * 

3.85 ♦ * 

3.97 ♦ * 

4.09 ♦ * 

4.21 ♦ * 

4.33 ♦ * 

4.45 ♦ * 

4.57 + 

4.69 +♦ 



+++++++++++++++++++++++++++++++++++++++++++++++++++++ 

THE NUMBER OF OBSERVATIONS MADE WERE 200 

THE AVERAGE IS 4. 1728 

THE MODE IS 4.09 

THE VARIANCE IS 3.20242E-02 

THE STANDARD DEVIATION IS .178953 

THE MEDIAN IS 4. 16412 



0(1) E(I) (O(I)-E(I)) 2/H(I) 



3 


2.568 


.072673 


9 


10.696 


.268924 


28 


28.416 


6.09010E-03 


54 


48. 192 


.699968 


51 


52.202 


2. 76773E-02 


31 


36. 108 


.722601 


17 


15.958 


6. 80388E-02 


6 


4.498 


. 501557 


1 


.8088 


4. 51996E-02 



TOTAL 



2.41273 (Ccmputed Chi-square value) 



1. The usual measures such as average, variance, and standard deviation are obtained- 
In addition the maximum value, minimum value and range are determined. 

2. Preliminary indications are that the set of 2C0 numbers is normally distributed 
since the measure of sytDmetry is approximately zero, peakedness is approximately 3, 
and one, two, and three standard deviation bands contain respectively 68%, 95%, and 
99. 5% of the data. 

3. Nine classes were asked for and the grouped data for these nine classes is 
presented with the midpoint of each class. Ihe usual measures such as average, 
variance, standard deviation, median and mode are given for the data grouped into 
these nine classes. It is noted that the average and standard deviation for the 
grouped data {U.1728 and .178953) are approximately the same for the data when not- 
grouped {a.171£4and . 17579). 

4. The graph indicates the data is approximately symmetric about the average value of 

a. 1 71 a. 

5. The CHI-SQDARE value of 2.41 273 indicates th^t the normal curve (average of U. 11728 
and standard deviation cf .176953) fit is good at both the 5 percent and 1 percent 
level of significance { 0(1) is observed frequency and E(I) is expected frequency)- 



The program is particularly valuable in that it will offer descriptive statistics for 
any number of classes asked for. In seme realistic situations it is important to compare 
results using different numbers of classes and this can new be done with considerable ease. 
The graph of the data is of particular importance in that ore can quickly get an initial 
impression of how the data behaves. The program emphasizes the normal distribution. The 
reason for this is that in such topics as linear regression (second section of this paper) an 
assuaiption often stated is that a set of points is normally distributed- 



An assumption in the linear regression model is that the error terms are statistically 
independent. If violated, it is referred to as a problem cf autocorrelation, invalidating 
use of the usual t-tests. There are available computer "canned programs" currently which 
display the t-values corresponding to the coefficients of the least square estimate, but do 
not take autocorrelation into account. Therefore, in the presence of autocorrelation it is 
incorrect to use these t-values [2, p. 80]. 

Under suitable conditions, depending on the given data, autocorrelation can be 
eliminated by transformation to a new set of variables. In the transformed system it is 
valid to use the t-values. This subject is treated clearly by Frank [3]. A typical case 
where autocorrelation is not considered is Parker and Segura's article [^], an interesting 
and useful paper. In their article, sales are forecast in the home furnishing industry using 
new mar ri ages durin g the year , housing starts, annual disposable income, and time trend as 
explanatory variables, using 2^4 years of data. This is an example of time-series data where 
one must be especially careful for the presence of autocorrelation. In the transformed 
system it is demonstrated that autocorrelation is eliminated and the least square estimate 
provides a reasonably good estimate of the original set of data with the t-values significant 
at approximately a 5% level of sigr.if icance. 



ERIC 



GIVEN 2^ YEARS OF DATA 



Disposable 



Year 


Housing starts (H) 
(thousands) 


per scnal 


income ( I) 
billions) 


New marriages (K) 
(thousands) 


company 
($ 


sales ($) 
millions) 


Time (T) 


19U7 






1 58. 9 


2,291 




92.920 


1 


1948 


942 




169. 5 


^991 




122.440 


2 


19U9 


1 , 033 




188. 3 


1,811 




1 25. 570 


3 


1950 


1 , 138 




187. 2 


1,580 




no. 4 60 


4 


1951 


^ ,s^9 




205.8 


1,667 




139.400 




1952 


1,211 




224. 9 


1,595 




1 54.020 


6 


1953 


1 , 251 




235. 0 


1,539 




1 57, 590 


7 


1954 


1 , 225 




247. 9 


1,546 




152.230 


8 


1955 


1,354 




254. 4 


1,490 




139.130 


9 


1956 


1 ,47 5 




274. 4 


1,531 




1 56. 330 


10 


1957 


1 , 240 




292. 9 


1,585 




1 40. 470 


11 


1958 


1 , 157 




308. 5 


1,518 




128.240 


12 


1959 


1 ,34 1 




318. 8 


1,451 




1 17.450 


13 


1960 


1 ,531 




337.7 


1,49a 




1 32.640 


14 


1961 


1 ,274 




350. 0 


1,527 




126.160 


15 


1962 


1,274 




4 


1,547 




1 16.990 


16 


1963 


1 ,469 




385. 3 


1,580 




1 23.900 


17 


196a 


1 ,615 




404. 6 


1,654 




141,320 


16 


1965 


1 ,538 




436* 6 


1,719 




1 56.7 10 


19 


1966 


1 , 48 8 




469. 1 


1,789 




171.930 


20 


1967 


1,173 




505. 3 


1,844 




184.790 


21 


1968 


1 ,299 




546. 3 


1,9 13 




202.700 


22 


1969 


1 , 524 




590. 0 


2,059 




237.340 


23 


1970 


1 ,479 




029.6 


2,132 




254.930 


24 



Define the fcllowirg for year i: 

Hj^ = Housing Starts 
I^=Annual Disposable Income 
M . =New Marriages 
S^=Gross SaleE 
Ti =Tini€ Trend 
i*-1,2,...,24 

Presence of autocorrelaticn in error_ternts 

Using a linear regression computer program the fcllcwing least square estimate is 
obtained: 

S = 50.605 + .036H + 1.2211 -.068M -19.483T (1) 
The following table is then established: 



GROSS SALES 



ESTIflATED 
GEOSS'SALES 



ERROR TERMS 



1 


92. 94 0 


95.673 


-2.752 


2 


122.440 


116.729 


5.711 


3 


125. 57C 


135.752 


-10. 181 


^ 


11 0. 460 


134.467 


-24.006 


5 


139.400 


146.578 


-7. 178 


6 


154. 02C 


143. 129 


10.891 


7 


157.590 


141.237 


16.354 


e 


152. 230 


136.085 


16.145 


9 


139, 130 


133.008 


6.122 


10 


156.330 


139. 506 


16.824 


11 


140. 470 


130.447 


10.024 


12 


128.240 


131. 561 


-3.341 


13 


117.450 


135.877 


-18.427 


ia 


132. 640 


143. 38^J 


-10.744 


15 


126.160 


127. 395 


-1.236 


16 


116.990 


126.037 


-9 .047 


17 


123. 900 


134.937 


-11 .037 


18 


141.320 


139.232 


2 .088 


D9 


156.710 


151 .599 


5.111 


20 


171 .930 


165.210 


6.720 


21 


184.790 


174.802 


9.988 


22 


202.700 


205. 204 


-2 .504 


23 


237. 340 


236.319 


-.979 


2a 


254.930 


259. 474 


-4. 544 



The t-values corresponding to the coefficients of the least square estimate are not displayed 
here since it will be shown that there is the presence of first order autocorrelation 
invalidating the use of these t-values. The main point cf this section is that a computer 
program is incorrect if it displays these t-values if autoccr relaticn is present. 

Denoting by Xi the error term for year i , a test for first order autocorrelation is 
based on the Durbin-Hat son statistic [3, p. 276]. 




Since the number cf observations is 24 and 4 independent variables are under 
consideration it follows frcm Table E of Frank's book [3, p. 276] that the d^ and ^2 for a 
two-tailed critical region of 5 per cent are given by 



d^ = .91 d2 1.66 (3) 



and that therefore 



d < dj^ < d2 (4) 
roncluding that there is significant first-order autoccr relati on [3, p. 279]. 



Transformation to. d new__set of variables--eliiEinating autoccr relation 

The first step is to determine ^, the coefficient cf autcccrrelation. The estimated 
value of p is given by Frank [3, pp. 280- 281]. 

24_ ^ 
^ ^2 ^ ^'^ 

P ^ ~^ ^ (^J 



An approximation to ^ is given by 

^ = 1/2 (2-d) = .557U (6) 

where d is the Durbin- Wat son statistic. 

Denoting by Hj^, 1^, H^, S-^ , a new set of variables the transformation is given by 
[Frank [3, p. 280} 



H. = 


«i 


-. 5571) 




= 


li 


-. 5571) 


Ii-1 


«i = 


"i 


-. 5571) 


«i-l 


Si = 


Si 


-. 557H 


^i-l 


T. = 


■^i 


-. 5571) 


^i-l 



where i = 2,3, . . . ,24. 



Performing these transformations 


leads to the 


following set 


of columns: 


i 


H. 
1 


\ 
I . 




M . 
1 


s. 

1 




T? 

1 


1 

2 


527.303 


80.93 1 




7 14. 022 


7C.647 




1 . 443 


3 


507.940 


93. 823 




701. 239 


57.323 




1.885 


a 


562.217 


82.244 




570. 568 


40.46S 




2. 328 


5 


914.691 


101. 457 




786. 325 


7 7, 831 




2.770 


6 


347.604 


110. 189 




665. 833 


76.320 




3.213 


7 


576. 002 


109. 643 




649. 964 


71.741 




3. 656 


8 


527.706 


116. 91 4 




688. 178 


64.391 




4.098 


9 


671 . 198 


116.223 




628. 277 


54. 279 




4. 541 


10 


720.295 


132. 600 




700. 490 


7 8.781 




4. 984 


11 


417.851 


139. 952 




731. 637 


53.333 




5* 426 


12 


465.836 


145. 241 




634. 538 


4S. 944 




5. 869 


13 


696.101 


146. 546 




604. 883 


45. 970 




6.311 


14 


783.541 


160. 172 




685. 229 


67. 175 




6.754 


15 


420. 637 


161.770 




694. 261 


52.228 




7. 197 


16 


616.886 


169.314 




695. 867 


46.670 




7,639 


1" 


729.345 


182. 187 




717.719 


58.691 




8,082 


18 


796. 196 


189.838 




773. 325 


72.260 




8. 524 


19 


637.817 


211 .080 




797. 079 


77.94C 




8.967 


20 


630.736 


225. 744 




830. 848 


84.582 




9. 410 


21 


343.605 


243. 829 




846. 831 


8e.958 




9.852 


22 


645. 183 


264. 65 1 




835. 175 


9S.700 




10.295 


23 


799.952 


286. 398 




992. 715 


124.357 




10.737 


2a 


629.539 


300. 23 9 




984. 336 


122.639 




11. 180 


For 


these transformed 


varia blefs 


the 


least square estimate is 


given 


by 




S = -43. 879 


+ .023H + 


. 7131^ + . 088H^ 


-12.709T^ 


(8) 




In 


the transformed system the error 


terms are 


given as fellows : 





ERIC 



i 



(ESTIMATED) 



ERROR TERMS 



1 




— 




2 


70.647 


70.590 


.0 57 


3 


57.323 


72. 583 


-15,260 


^ 


40.469 


48.479 


-8.010 


5 


77 .83'' 


83.754 


-5.923 


6 


76.320 


60.525 


15.795 


7 


71 .741 


58. 447 


13,294 


8 


64 .391 


60. 240 


4. 152 


9 


54.279 


52. 205 


2. 074 


10 


78.781 


65.755 


13.0 26 


1 1 


53.333 


61.051 


-7,713 


1 2 


49 .944 


51.782 


-1 .838 


13 


45 .970 


49.856 


-3.885 


ia 


67.175 


63.053 


4 . 121 


15 


52.228 


50. 890 


1 .337 


16 


46.670 


55.368 


-8.698 


17 


58.691 


63. 470 


-4 . 779 


18 


72.260 


69.751 


2 . 508 


19 


77.940 


77.667 


. 273 


20 


84 .582 


85.304 


- .722 


21 


88.958 


87. 279 


1 .679 


22 


99.700 


106.915 


-7,215 


23 


J 42 .357 


1 29.868 


12 . 489 


2^ 


122.639 


129.398 


-6 .759 



For these error terms the Durbin-Watson statistic is given by 

d = 1.8015 (9) 

Since 23 observations and 4 independent variables are under consideration one 
establishes from Table E cf Frank's book [3, p. 276 ] that 

d^ = .89 a 2 = 1-67 (10) 

And therefore in the transformed system 

d > d 2>dj^ (11) 

in which case one accepts the hypothesis cf no first crder autocorrelation [3, p. 279]. 
Therefore in the transformed system it is valid to use the t-values. 



Siflnif ica nee _of_t- values 

The transformed equation with the corresponding t-valu€s below them are 

S = -43.879 f .023H ♦ 7l3l ♦ 088M -12.709T 

- 1.952 1. 694 2. 959. 1 . 658 3. 237 (1 2) 

Coefficient cf Determination = , 893 

Standard Error of Estimate = 8.845 

In order to determine the significance of the t-valuos one uses a Student's-t 
disbributicn with (N-K) degrees of freedom. Since in the transformed system 

N = number of observations = 23 

K = total number of variables under consideration = 5 

It follows that the number of degrees of freedom is 18- The critical value for a 5% 
one-tailed test of significance using the Student' s-t distribution with 18 degrees of freedom 
is 1.729. The t-values associated with the constant term, dispcsable personal income and 
time trend are significant at a 5^ level. The t-values associated with new housing starts 
and new marriages are significant at approximately a S% level. 



NOTES AND REFERENCES 

1. Ya-Lun Chou, Statistical Analysis. Holt, Rinehart, and Winston, Inc., 1969. 

2. w. L. Hays and L. W. Winkler, Statistics, Vol. II. Holt, Rinehart and Winston, Inc., 
1970. 

3. C. R. F. Frank, Jr., Statistics and Econometrics. Holt, Rinehart and Winston, Inc., 
1971. 

M. G. C. Parker and E. L. Segura, How to get a tetter £or*5cast. Harvard Business Review, 
1971, March-April, 99-109. 

♦Please send correspondence to Dr. Munick. 



ERIC 



